ITS1 classification

Leighton Pritchard

28 September 2016

What is the goal?

Broadly-speaking…

Identify and profile Phytophthora present in a nursery

  1. Sequence sample from nursery
  2. Put reads into “black box”
  3. Be told which Phytophthora spp. are present (and in what relative quantity)

More specifically…

  1. Obtain samples from nursery
  2. Extract DNA from samples
  3. Nested PCR amplification of ITS1 regions
  4. Sequence amplification product to get PE reads
  5. Put reads into “black box”
  6. Be told which Phytophthora spp. are present (and in what relative quantity)

So what’s in the box?

Given a set of reads…

  • Assemble reads into complete ITS1 regions
  • Compare to known ITS1 sequences (supervised) to identify originating species. Needs reliable database.
  • Apply clustering (unsupervised) to identify originating species/novel OTUs
  • Use abundance to estimate relative quantity of originating species

Clustering ITS1

Sequence survey 1

Are either of these statements true? (not accounting for primer choice…)

  • There is only one ITS1 sequence per species?
  • There is only one ITS1 sequence per isolate?

One ITS1 per species

>1 ITS1 per species

NJ tree of ITS1 regions from cocoa-associated Phytophthora spp. Appiah et al. (2004) Plant Path. doi:10.1111/j.0032-0862.2004.00980.x

One ITS1 per isolate

How many ITS1 per isolate?

  1. Identify ‘canonical’ ITS1 sequence
  2. Query (BLAST/HMMer) against sequenced genomes
  3. Count matches

Six sequenced genomes

  • Distinct BLAST hits with ITS1 query
    • P. infestans: 133
    • P. sojae: 7
    • P. cinnamomi: 2
    • P. kernoviae: 1
    • P. ramorum: 1
    • P. cambivora: 12

But…

Assembly questions…

  • ITS1/rRNA regions are repetitive
  • Assemblies collapse repetitive regions
  • Is there evidence of assembly collapse?
  • Excessive read coverage to genomic ITS1 is evidence of assembly collapse
  • Compare coverage to ‘conserved, single-copy genes’ (BUSCO)

Six sequenced genomes

  • Ratio ITS1:BUSCO gene coverage = estimated copy number
    • P. infestans: 97
    • P. sojae: 40
    • P. cinnamomi: 60
    • P. kernoviae: 23
    • P. ramorum: 8
    • P. cambivora: 84

>1 ITS1 per isolate

ITS1 diversity

47 sequences from P. infestans T30-4

Estimating isolate ITS1 diversity

  • collapsed regions not yet assembled
  • assembly options include: GRAbB, MITObim, and riboSEED
  • without assembly, estimate SNPs/phasing/profile

Sequence survey 2

If there is more than one ITS1 sequence per isolate…

  • How diverse are ITS1 sequences?
    • <80%-100% sequence identity
  • Is there possibility for confusion in classification, or in a representative database?

Confusion?

Confusion

  • 154 ‘representative’ ITS1 sequences
  • Clustering using SWARM
  • 122-138 clusters resulted (dependent on stringency)
  • Potential confusion but so far only within a clade (sensu Martin et al.)

Algorithms

Classification

Three stages

  1. QC: FastQC to clean reads
  2. Assembly: PEAR to assemble reads
  3. Clustering: SWARM to cluster reads (unsupervised) for OTUs

Flowchart

Working prototype

  • Still much to do, but groundwork in place
  • PCRMIX96 testing: 10 known species
    • 9 species present identified
    • 8 false positives identified
    • 1 false negative
  • Some issues due to ITS1 sequence ‘confusion’

Validation

  • We need to establish objectively that the clustering/classification works:
    • samples of known spp. composition to test accuracy of classification
    • FPR, FNR, Sp, Sn, etc.
  • We need to build a representative database of ITS1
    • genome analysis to catalogue ITS1 variation
  • Comparison against, e.g. PIPITS (requires database development)